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Quantum-enhanced metrology infers an unknown quantity with accuracy beyond the standard 
quantum limit (SQL). Feedback-based metrological techniques are promising for beating the SQL 
but devising the feedback procedures is difficult and inefficient. Here we introduce an efficient self- 
learning swarm-intelligence algorithm for devising feedback-based quantum metrological procedures. 
Our algorithm can be trained with simulated or real-world trials and accommodates experimental 
imperfections, losses, and decoherence. 



Precise metrology underpins modern science and engin- 
eering. However, the 'standard quantum limit' (SQL) re- 
stricts achievable precision, beyond which measurement 
must be treated on a quantum level. Quantum-enhanced 
metrology (QEM) aims to beat the SQL by exploiting 
entangled or squeezed input states and a sophisticated 
detection strategy [1-3]. Feedback-based QEM is most 
effective as accumulated measurement data are exploited 
to maximize information gain in subsequent measure- 
ments, but finding an optimal QEM policy for a given 
measurement device is computationally intractable even 
for pure input states, unitary evolution [/, and projective 
measurements. Typically, policies have been devised by 
clever guessing [4, 5] or brute-force numerical optimiza- 
tion [5] . Recently we introduced swarm-intelligence rein- 
forcement learning to devise optimal policies for measur- 
ing an interferometric phase shift [6]. Our algorithm is 
space efficient; i.e. the memory requirement is a polyno- 
mial function of the number of times TV that U is effected, 
in contrast to the exponentially expensive brute-force al- 
gorithm. Although our result demonstrated the power 
of reinforcement learning, our algorithm requires a run- 
time that is exponential in N and a perfect interfero- 
meter, thereby effectively restricting its applicability to 
proofs of principle. Here we report a space- and time- 
efficient algorithm (based on new heuristics) for devis- 
ing QEM policies. Our algorithm works for noisy evolu- 
tion and loss, thus making reinforcement learning viable 
for autonomous design of feedback-based QEM in a real- 
world setting. 

We restrict our focus to single-parameter QEM. In- 
terferometric phase estimation is the canonical quantum 
metrology problem and is applicable to measurements of 
time, displacements, and imaging. Therefore, we develop 
and benchmark our algorithm for autonomous policy 
design in this context. To beat the SQL, we employ an 
entangled sequence of N input photons, feedback control, 
and direct measurements of the interferometer output. 
For adaptive phase estimation, the interferometer pro- 
cesses one photon at a time. Each input photon can be 
in two modes, labeled {|0}, |1)}, corresponding to the in- 
terferometer's two paths. Thus, a time-ordered sequence 
of N photons implements an A^-qubit state. 

We assume that the interferometric transformation 
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Figure 1. Adaptive feedback scheme for estimating an in- 
terferometric phase Lp. The input state I^a^) is fed into the 
unital quantum channel C one qubit at a time and the output 
qubit is measured or lost. The processing unit (PU) shifts the 
interferometric phase by $ after each successful measurement 
prior to processing the next qubit. 

(Fig. 1), can be expressed as a tensor product of 
quantum channels (i.e. completely-positive trace-nonin- 
creasing maps [7]) C{if>]^rn) for ip the unknown phase 
shift being estimated and <l>^ a controllable phase with 
m = 0, l,...,A/'-l. The channel C is a noisy version of 
the restrictive single-qubit unitary process U normally 
considered in QEM. Our tensor-product description cor- 
responds to the assumption that the interferometric pro- 
cess, other than the control, is unchanging during the 
measurement procedure. Photons of the A'-qubit input 
state I^at) enter the interferometer one-by-one, are trans- 
formed by C. Detectors measure where each photon exits, 
thereby implementing a projective- valued measure with 
elements {|0)(0|, |1)(1|} that yield one bit g {0, 1} if the 
photon is not lost. The processing unit (PU) modifies 
the interferometric phase shift by $rn, according to the 
measurement history hm = UmUm-i - • - Ui ^ {0,1}^ up 
to the m*^ photon, prior to the next photon being pro- 
cessed. After all A" input qubits have passed through 
the interferometer, the PU estimates the interferometric 
phase shift (p diS (p. A policy ^ is a 'behavior pattern' for 
the PU, i.e., a collection of rules that tell the PU how to 
set given hm and which phase estimate to report at 
the end. 

The error probability distribution P{^\g) of the policy 
Q yields the standard error A(p{q) of the estimate (p for 
^ := (p - (p. As is cyclic over 27r, Aip{Q) is given by 
the Holevo variance Vii{q) = A(/p(^)^ := S{q)~'^ - 1, for 
S(g) ■= \f^^P(^\g)e''d^\ the sharpness of P(^\g) [8]. 
Evaluating S{g) requires exponential computing time 
with respect to A" and thus is computationally intract- 
able. However, from K trial runs of g with randomly 
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chosen phases (/^i, . . . , (fKi we can infer a sharpness es- 
timate S := I Xl^i exp(ic:/c)|/i^ for ^j. the error of the k^^ 
phase estimate. For QEM, A(p{g) should scale better 
than the SQL Acp oc and as close as possible to 

the ultimate Heisenberg limit Acp oc l/N [1-3]. 

For unitary evolution, the interferometer transforms 
each input qubit by Un{0) = exp{-i^(T ■ n} for & := 
{cFx^cFy^cFz) the Pauli matrices, n a unit vector, and 
if - ^ = 26 the interferometric phase difference. Without 
loss of generality, we can restrict our analysis to n = 
(0, 1,0). However, because of imperfections, a real-world 
interferometer is represented by a non-unitary quantum 
channel C. We assume an unbiased interferometer, i.e. a 
random input qubit 1 = |0}(0| + |1}(1| is mapped to itself 
(C(l) = 1), corresponding to a unital channel. Hence, for 
continuous or discrete and countable n and 6 [9], 

C(*) = f Wr,(0)Ur,(0) • U^O). Wr^(0) G R, (1) 

n,e 

^^^^ fn,e^ri{0) = 1 and Wn{0) = (^0,(^-<^>(^n,,(o,i,o) for an 
ideal interferometer. In contrast, ^ Wn{0) = 1 - r] cor- 
responds to an input state-independent loss rate r^, and 
quantum noise is incorporated by Wn{0) being a gen- 
eral distribution with (0) = {(f - ^)/2 and (n) = (0, 1,0). 
We simulate noise using normal distributions with the 
aforementioned means and small standard deviations 
o'e^cTn « 1, corresponding to visibility l/(2e^^^ - 1). For 
an optical interferometer, 6 noise corresponds to path- 
length difference fluctuations and n to beam splitter re- 
flectivity fluctuations. We utilize the input state 

from [4-6], with dl^^{P) Wigner's d-matrix [10]. |7i}[7v] 
is a permutationally-symmetric state with n qubits in |1} 
and N-n in |0} [11]. The state I^at) is appealing because 
it allows precision close to the Heisenberg limit [4, 5] and 
is robust against loss [6] , but our learning methods work 
for other states as well. 

The control flow graph of any deterministic policy for a 
lossless conditions and a fixed A/'-qubit input state can be 
represented as a binary decision tree of depth TV with an 



-7r/2 



V2 iAs 



-1 



example shown in Fig. 2(a). Each of the E£=o ^ = 2 
nodes of the tree corresponds to one specific state of 
the experiment and represents the resultant action of 
the policy. Numeric optimization is computationally in- 
tractable due to the exponentially large number of nodes. 
Therefore, we restrict our search to policies that imple- 
ment a 'generalized logarithmic search' (GLS) heuristic 
as described below, because the set of all GLS policies 
can be parametrized by only N parameters and contains 
phase estimation policies with optimal precision scaling 
[6] with respect to N. 

For a uniform prior of (p e [0,27r), the GLS heur- 
istic commences with the initial feedback <l>o = 0. After 
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Figure 2. (a) Decision tree representation of a GLS policy 
for N = 2 (solid) and N = 3 (entire tree). For each path 
in the tree, the inner nodes represent the applied feedback 
phases $rn and the leaf shows the final phase estimate (p. 
At depth m, a measurement Um+i = directs the path to 
the left and Um+i = 1 to the right, (b) Embedding the best 
policy e Vn in the policy space Vn+i, shown for N = 
2. From the best two-qubit policy the policy ^3 e V3 
is generated as a guideline. The initial candidate policies for 
three input qubits are chosen according to probability density 
(3), indicated by the shaded area around ^3. (For clarity, the 
N = 2 case is depicted, although only candidate policies for 
N > 10 are chosen according to (3).) 



the m*^ measurement result Um ^ {0,1}, the feedback 
phase is = ^m-i - (-1)^""A^. If the qubit is 
lost, ^ remains unchanged. After all N input qubits 
are processed, there are M < N measurement results 
iiM,---,^i, and the GLS heuristic reports the phase 
estimate if = ^m-i - (-I)^^Am- According to this 
parametrization, every GLS policy for an A^-qubit in- 
put state is represented by a vector g = (Ai,...,A7v) 
in the policy space Vn = [-7r,7r)^, and any such vec- 
tor Q G Vn is a valid policy. As any policy q e Vn 
utilizes a string of N input qubits, we refer to it as 
an A/'-qubit policy. Every q g Vn implements a GLS 
because q has variable entries compared to logarithmic 
search (LS) for which = |A^_i [12]. The AT-qubit 
LS policy (7r/2,7r/4, . . . ,7r/2^) e Vn but does not surpass 
the SQL. The duality between GLS policies and points 
in Vn ^ allows the use of function optimization tech- 
niques to search for an optimal ^^p^ e Vn with minimum 
Aif, i.e. ^opt e argmin^,p^ V(^) = argmax^.p^ 6'(^). 
Unfortunately, this optimization problem is non-convex 
and hence difficult [6]. 

Particle swarm optimization (PSO) algorithms [13, 14] 
are outstandingly successful for non-convex optimization. 
PSO is a 'collective intelligence' strategy from the field of 
machine learning that learns via trial-and-error and per- 
forms as well as or better than simulated annealing and 
genetic algorithms [15-17]. We have shown that PSO also 
delivers an autonomous approach to devising adaptive 
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phase-estimation pohcies for ideal interferometry [6, 18]. 

To search for ^opt^ PSO algorithm models a 

'swarm' of S 'particles' {p^^\ p^'^\ . . . ^ p^^^ } that move 
in the search space Vn- A particle's position ^^^^ e Vn 
represents a candidate policy for estimating (p, which is 
initially chosen at random. Furthermore, p^'^^ remembers 
the best position, q^^\ it has visited so far (including its 
current position). In addition, p^'^^ communicates with 
other particles in its neighborhood c {1, 2, . . . 

We adopt the common approach to set each in a 

pre-defined way regardless of the particles' positions by 
arranging them in a ring topology: for all particles 
with maximum distance r on the ring are in In 
iteration the PSO algorithm updates the position of 
all particles in a round-based manner as follows. 

(i) Each particle p^*^ samples S{q^'^^) of its current po- 
sition with K trial runs. 

(ii) p^*^ re-samples S{q^^^) of its personal-best policy 
Q^^\ and the performance of q^^^ is taken to be the 
arithmetic mean S{q^^^) of all sharpness evaluations. 

(iii) Each p^^ updates q^^ if S{q^^) > S{q^'^) and 

(iv) communicates q^^^ and S{q^^^) to all members of 

(v) Each particle p^*^ determines the sharpest policy 
A^*-^ = max^g^(i) Q^^^ found so far by any one 
particle in ^^'^^ (including itself) and 

(vi) moves to 

The arrows indicate that the right value is assigned to 
the left variable. The damping factor uj assists conver- 
gence, and ^1,^2 are uniformly-distributed random num- 
bers from the interval [0, 1] that are re-generated each 
time Eq. (2) is evaluated. The 'exploitation weight' Pi 
parametrizes the attraction of a particle to its personal 
best position q^'^\ and the 'exploration weight' P2 de- 
scribes attraction to the best position A^*'^ in the neigh- 
borhood. To improve convergence, we bound each com- 
ponent of uS^^^ by a maximum value of i^max- The user- 
specified parameters a;,/3i,/32, and i^max determine the 
swarm's behavior. Tests indicate that u = 0.8, f3i = 0.5, 
/32 = 1, and z^max = 0.2 result in the highest probability to 
find an optimal policy. 

The K trial runs for assessing sharpness can be simu- 
lated or performed with a real world-experiment. For fi- 
nite the sampled sharpness has statistical errors that 
can prevent the PSO algorithm from learning optimal 
solutions [19]. We reduce sharpness errors by averaging 
over multiple samples in step (ii) [20]. However, for 
A/" > 12, the PSO algorithm fails to learn good policies 
from scratch due to sharpness errors [18]. Therefore, we 
maintain our earlier strategy of running the learning al- 
gorithm for each N independently when TV < 10. For 



N > 10, our new heuristic bootstraps a starting point 
for the optimization of an 7V-qubit policy from the best 
(iV - l)-qubit policy = (A^, . . . , A^,^). Our heur- 
istic exploits the fact that an {N - l)-qubit policy can 
be used as an 7V-qubit policy by ignoring the N^^ meas- 
urement result. For N > 10, the optimal (A^ - l)-qubit 
policy estimates phases with only 10% less accuracy com- 
pared to an optimal A/'-qubit policy when used with the 
A^-qubit input I^at) [21]. Furthermore, the performance 
difference between the optimal A^-qubit policy and the 
(A'-l)-qubit policy decreases with increasing A^ because 
the relative change in qubit number decreases with in- 
creasing N. Therefore, a good {N - l)-qubit policy is a 
valuable starting point for optimizing an A/'-qubit policy. 

Utilizing previously learned policies is done at the ini- 
tialization step of the PSO algorithm. The initial policy 
Q e Vn is selected as the particle's starting position with 
probability 




with A/'^,cr(^) a truncated normal distribution. See Fig. 
2(b) for an illustration of this strategy. The standard de- 
viation (Ji determines the similarity of the first N actions 
of the newly generated policies compared to the template 
policy g\ cf2 determines the extent to which the action 
for the new A^*^ qubit agrees with the previous action of 
q' . We found that ai = O.OItt and (J2 = 0.257r yields a 
high success rate for our PSO heuristic. 

For 4 < A/" < 14 and perfect interferometry, we verified 
that our new PSO algorithm with swarm size S = 20A^ 
learns optimal A/'-qubit policies regardless of whether 
each policy's sharpness is evaluated exactly (requires time 
oc 2^) or sampled from K = lON^ trial runs (requires 
polynomial runtime in N when simulated). Therefore, 
we sample the sharpness of each particle's current pos- 
ition and personal best position in each PSO iteration. 
As we run the PSO algorithm for a constant 300 iter- 
ations, the entire optimization process requires 0{KE) 
trials. However, to obtain an A/'-qubit policy, we have to 
optimize policies for 10, 11, . . . , A/' - 1 input qubits before- 
hand, as our algorithm requires an {N - 1) qubit policy 
for devising an A/'-qubit policy for any A^ > 10. Therefore, 
learning an A/'-qubit policy requires 0{NKE) = 0{N^) 
trial runs. When the trials are simulated, the computa- 
tional complexity of our PSO heuristic is O(N^) (hence 
efficient) as a single trial run can be simulated in time 
0{N'^) [11]. Once learned, the execution of an A/'-qubit 
policy requires A^ entangled input qubits. 

We trained our PSO algorithm with simulated trial 
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Figure 3. Holevo phase variance Vh of PSO-optimized policies 
compared to other schemes vs. the number of input qubits N 
for (a) I^at) and (b) |0---00) as input states, respectively. The 
dashed line shows the SQL. Due to limited computational 
resources, some simulations are carried out only to A/" < 25. 
(Loss rate 77 is in percent; = (Jny = (Jn^ = 0.2ae.) 



runs for various noise and loss rates. In each case, our 
PSO algorithm tries to find the sharpest policy for 
given N. As the algorithm uses stochastic optimization, 
it is not guaranteed to learn the optimal policy every 
time and must be run several times independently for 
each N. Nevertheless, within the limits of available com- 
putational resources, the PSO algorithm succeeded in at 
least 25% of the runs, independently of A^. We compared 
the policies generated by our new machine-learning al- 
gorithm to our previous numerically-optimized policies 
[6] , the Berry- Wiseman (BW) policy [4] , and policies ob- 
tained by brute-force numerical optimization [5]. 

We first discuss policies for a noiseless, lossless setup, 
i.e., for unitary evolution. Fig. 3(a) shows that our new 
method, tested to the limits of available computational 
resources, outperforms the BW-policy. We estimate the 
performance difference by calculating the scaling a of the 
Holevo variance Vn. Our policies yield Vb_ ^ with 
<^PSO = 1.494 ±0.003, compared to the inferred scaling 
aBw = 1.415 ± 0.003 for N < 50. Furthermore, our new ef- 
ficient method greatly surpasses our previous optimiza- 



tion scheme [6] by more than tripling the domain of N for 
developing policies while maintaining the same precision. 
The inefficient brute-force optimization was carried out 
in the full policy space, i.e. without restriction to GLS- 
policies. However, the resulting globally optimal policies 
perform better only by a constant factor of 0.88 ± 0.01 
compared to our PSO-optimized policies but do not yield 
better scaling a. As expected the PSO algorithm yields 
policies approaching the SQL Vh oc l/N for separable 
input states (Fig. 3(b)) [1-3]. 

Our new algorithm delivers the first QEM policies op- 
timized for a simulated imperfect interferometer with loss 
and Gaussian quantum noisy. When applied to noisy 
conditions, policies generated by our new algorithm have 
significantly improved performances compared to policies 
optimized for perfect interferometry. As expected, the 
performance difference increases with the noise level [22] . 
We verify that our algorithm successfully devises superior 
policies also for non- Gaussian noise by using skew- normal 
distributions with skewness 7 = 0.667 for Pq and Pn [23]. 
We find that a nonzero third standardized moment with 
variances kept as before does not reduce the performance 
of the policies learned by our new PSO algorithm [24]. 

In summary, we have devised an efficient machine 
learning algorithm to construct adaptive-feedback meas- 
urement policies autonomously for time-independent, 
single-parameter estimation problems. Our one pre- 
requisite is a training-phase comparison criterion to eval- 
uate the success of candidate policies. Within the 
limits of available computational resources, our PSO- 
generated policies outperform all known schemes for ad- 
aptive single-shot phase estimation with direct measure- 
ment of the channel output. Our algorithm learns to 
account for experimental errors and loss thereby making 
time-consuming error modeling and extensive calibration 
dispensable. 
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Appendix 




Figure 4. Holevo phase variance Vu of PSO optimized GLS Figure 5. Holevo phase variance Vh of pohcies optimized 

pohcies. Purple crosses (x): A/'-qubit poUcy used with input for simulated Gaussian quantum noise ( ) and skew- normal 

state I^at). Brown pluses (+): (A/" - l)-qubit policy used with quantum noise with skewness 7 = 0.667 (•). In both cases, 

input state I^at) (the last measurement result is ignored by we used the standard deviations ae = 0.027r and a-n^ = any = 

the policy). (Jn.^ = 0.2(76). 
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Figure 6. Holevo phase variance Vu of policies from [6], that are optimized for a perfect interferometer (O), compared to 
the policies optimized by our new algorithm for the specific imperfections (•). The performance of the policies are evaluated 
for Gaussian quantum noise with standard deviations ag and an = (e, \/l - s), s = O.2a0. (a) For low noise (77 = 5% and 
ae = 0.027r), there is no noticeable performance enhancement, (b) For larger noise (77 = 5% and ae = O.Itt), the policies optimized 
for perfect interferometry have a performance scaling Vr oc with ai = 1.162 ± 0.003. In contrast, the policies optimized for 
the aforementioned noise and loss achieve a scaling of a2 = 1.236 ± 0.003. 



